Skip to content

Fix silent data corruption in JIT eltwise kernel for i8/u8 bitwise ops with broadcast#34639

Open
goyaladitya05 wants to merge 2 commits intoopenvinotoolkit:masterfrom
goyaladitya05:fix/jit_eltwise_bitwise_i8_broadcast
Open

Fix silent data corruption in JIT eltwise kernel for i8/u8 bitwise ops with broadcast#34639
goyaladitya05 wants to merge 2 commits intoopenvinotoolkit:masterfrom
goyaladitya05:fix/jit_eltwise_bitwise_i8_broadcast

Conversation

@goyaladitya05
Copy link

@goyaladitya05 goyaladitya05 commented Mar 11, 2026

Fixed a data corruption bug in load_vector() where broadcasting i8/u8 values during bitwise operations produced incorrect results.

Cause

load_vector() has two broadcast paths based on whether src_prc == dst_prc:

  • src_prc != dst_prc: calls load_scalar to widen the value to 32 bits first, then broadcasts with uni_vbroadcastss - correct, the value is already 32-bit by the time it is broadcast.
  • src_prc == dst_prc: also called uni_vbroadcastss unconditionally - wrong for 8-bit types.

vbroadcastss copies 4 bytes at a time. For an i8 value, only byte 0 of each 4-byte lane gets the scalar; the other 3 bytes are zeroed. In a 256-bit register that means 8 correct bytes and 24 zeros, so any bitwise AND/OR/XOR operating on those lanes silently produces wrong results.

Fix

In the src_prc == dst_prc branch, dispatch on src_prc.size() instead of always calling uni_vbroadcastss:

  • 1 byte (i8/u8)

    • AVX2+: vpbroadcastb - fills all byte lanes directly.
    • SSE4.1: punpcklbw + punpcklbw + pshufd 0 - SSE has no byte-broadcast instruction; two unpacks interleave the byte with itself, then pshufd splats it across all dword lanes.
  • 2 bytes

    • AVX2+: vpbroadcastw.
    • SSE4.1: punpcklwd + pshufd 0.
  • 4 bytes (i32/f32): uni_vbroadcastss is unchanged.

Tests

Added smoke_CompareWithRefs_2D_Bitwise_i8u8_Broadcast to eltwise.cpp.

  • 24 test cases: AND / OR / XOR × i8 / u8 × CONSTANT / PARAMETER secondary input.
  • Shapes: two pairs - {1,64} vs {1,1} and {32,256} vs {1,1} - from the bug report. Each pair runs inference twice (full shape, then the {1,1} broadcast operand) to exercise the fixed path.
  • 2D only, no format constraints: unlike the existing 4D bitwise suite which tests nhwc/nchw layout permutations, 2D tensors have no channel-last layout so no CPUSpecificParams format is set and keeps the test focused purely on broadcast correctness.

Closes #34638

AI Assistance:

  • AI assistance used: yes
  • If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks): Used Claude Sonnet 4.6 to help with pinpointing the location of bug, and fixes.
    Built it locally, and verified everything works.

@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Mar 11, 2026
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Mar 11, 2026
@goyaladitya05 goyaladitya05 marked this pull request as ready for review March 12, 2026 07:23
@goyaladitya05 goyaladitya05 requested review from a team as code owners March 12, 2026 07:23
Copilot AI review requested due to automatic review settings March 12, 2026 07:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a silent data corruption issue in the Intel CPU plugin’s x64 JIT eltwise kernel when broadcasting i8/u8 scalars for bitwise ops, and adds a focused regression test to cover the broadcast scenario from the reported bug.

Changes:

  • Update jit_uni_eltwise_generic::load_vector() to use byte/word-aware broadcast for src_prc == dst_prc (avoids vbroadcastss for 8-bit types).
  • Add a new 2D bitwise broadcast instantiation to validate i8/u8 AND/OR/XOR correctness when one operand is {1,1}-broadcast.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/plugins/intel_cpu/src/nodes/kernels/x64/jit_uni_eltwise_generic.cpp Fixes scalar broadcast emission for 8-bit element types in the JIT load path used by bitwise ops.
src/plugins/intel_cpu/tests/functional/custom/single_layer_tests/instances/common/eltwise.cpp Adds a regression test suite covering i8/u8 bitwise ops with {1,1} broadcast in 2D shapes.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +562 to +569
case 2:
if (isa == x64::sse41) {
punpcklwd(xmm_src, xmm_src);
pshufd(xmm_src, xmm_src, 0);
} else {
vpbroadcastw(vmm_src, xmm_src);
}
break;
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] load_vector() adds a 2-byte broadcast path (case 2), but this code calls load_scalar() first, and load_scalar() currently throws for src_prc == dst_prc with src_prc.size() == 2 (it only supports sizes 1 and 4 in that branch). As a result, the new 2-byte broadcast logic is effectively unreachable and any future attempt to broadcast u16/i16 without type conversion will still fail at runtime. Either add 2-byte support to load_scalar() for the src_prc == dst_prc case (load 16 bits and clear upper bits) or remove the case 2 handling here to avoid implying support that isn't actually implemented.

Suggested change
case 2:
if (isa == x64::sse41) {
punpcklwd(xmm_src, xmm_src);
pshufd(xmm_src, xmm_src, 0);
} else {
vpbroadcastw(vmm_src, xmm_src);
}
break;

Copilot uses AI. Check for mistakes.
@maxnick maxnick added this to the 2026.1 milestone Mar 12, 2026
@maxnick
Copy link
Contributor

maxnick commented Mar 12, 2026

build_jenkins

@goyaladitya05
Copy link
Author

Hi @maxnick , do I need to do any further changes to this? The CI failures seem infra releted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU OpenVINO CPU plugin ExternalPR External contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: bitwise_and / bitwise_or / bitwise_xor return incorrect values for int8/uint8 when broadcasting

4 participants